C/C++ Users Group Library 1996 July

home *** CD-ROM | disk | FTP | other *** search

/ C/C++ Users Group Library 1996 July / C-C++ Users Group Library July 1996.iso / vol_100 / 167_01 / fgrep.doc < prev next >

Wrap

Text File | 1985-08-19 | 28KB | 739 lines

/* HEADER: CUG TITLE: Parallel Pattern Matching and FGREP VERSION: 1.00 DATE: 09/20/85 DESCRIPTION: "Development of algorithm used in FGREP, a full emulation of Unix's 'fgrep' utility." KEYWORDS: fgrep, grep, filter, UNIX, pattern-matching SYSTEM: FILENAME: FGREP.DOC WARNINGS: CRC: xxxx SEE-ALSO: FGREP.C AUTHORS: Ian Ashdown - byHeart Software COMPILERS: REFERENCES: AUTHORS: Bell Telephone Laboratories; TITLE: UNIX Programmer's Manual Vol. 1, p. 166; AUTHORS: A.V. Aho & M.J. Corasick; TITLE: 'Efficient String Matching: An Aid to Bibliographic Search' Communications of the ACM pp. 333 - 340, Vol. 18 No. 6 (June '75); ENDREF */ /*-------------------------------------------------------------*/ Ian Ashdown byHeart Software 1089 West 21st Street North Vancouver British Columbia V7P 2C6 Canada Date: April 12th, 1985 Parallel Pattern Matching and FGREP ----------------------------------- by Ian Ashdown byHeart Software ***************************************************************** * * * NOTE: This manuscript was published in the September 1985 * * issue of Doctor Dobb's Journal. It may be reproduced * * for personal, non-commercial use only, provided that * * the above copyright notice is included in all copies. * * Copying for any other use without previously obtaining * * the written permission of the author is prohibited. * * * ***************************************************************** Preparing an index to a large technical book. Finding all references to a person in several years' worth of a company's minutes of meetings. Searching for all occurrences of a specific sequence of events in the volumes of data obtained from a scientific experiment. All of these problems and many more are examples of searching for patterns in a set of data. The question is, what is the most efficient search algorithm to use? For single patterns, such as searching for the phrase "binary tree" in a text file, there are two of particular interest: the Boyer-Moore Algorithm (reference 5) and the Knuth-Morris-Pratt Algorithm (reference 6). Both are fast and reasonably simple to implement. However, neither is appropriate when more than one pattern at a time must be searched for. One algorithm that is appropriate can be found in a paper published in the June '75 issue of "Communications of the ACM" (reference 1). Entitled "Efficient String Matching: An Aid to Bibliographic Search" and written by Alfred Aho and Margaret Corasick of Bell Laboratories, the paper presents "a simple and efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text". Without modification to the algorithm, a "keyword" can be taken to mean any pattern, and "a string of text" to mean any sequence of symbols, be it ASCII text, digitized data obtained from a video camera, or whatever. The algorithm has some interesting features. For instance, most pattern matching schemes must employ some form of backtracking, or rescanning of the string when an attempted match to a pattern fails or multiple patterns are to be matched. The Aho-Corasick algorithm differs in that it searches for all of the patterns in parallel. By doing so, it can process the string in one pass without any backtracking. The algorithm also recognizes patterns that overlap in the string. For example, if the patterns are "he", "she" and "hers", the algorithm will correctly identify matches to all three in the string "ushers". As to speed of execution, it is independent of the number of patterns to be matched! This perhaps surprising feature is a direct result of the patterns being searched for in parallel. Curiously, this algorithm has all but disappeared from the literature of computer science. Of the hundreds of textbooks and articles written since that discuss pattern matching, only a few reference the paper, and none that I know of present the Aho-Corasick algorithm itself. Unless you have access to back issues of "Communications of the ACM", you will likely never encounter it. On the other hand, if you work with UNIX you may have used a utility based on the algorithm: "fgrep". This useful tool searches for and displays all occurrences of a set of keywords and phrases in one or more text files. While it does not accept wildcard characters in its patterns like UNIX's more flexible "grep" utility, it does illustrate the speed of the Aho-Corasick algorithm in comparison with other pattern matching methods. Typically, "fgrep" is five to ten times faster than "grep" in searching files for fixed string patterns. Given its general usefulness, I thought it appropriate to reintroduce Aho and Corasick's work in this article. At the same time, it seemed a shame that "fgrep" has been restricted to the UNIX operating system. Therefore, the source code in C for a full implementation (reference 4) of "fgrep" has been included. This serves not only to demonstrate the algorithm, but also to bring the idea of a public domain UNIX one small step closer to reality. Inside The Machine ------------------ The Aho-Corasick algorithm consists of two phases: building a "finite state automaton" (FSA for short), then running a string of data through it, with each consecutive symbol being considered a separate input. Before analyzing these phases, a quick summary of what finite state automata are and how they work is in order. (For a more detailed discussion, reference 2 is recommended.) In essence, an FSA is a conceptual machine that can be in any one of a finite number of "states". It also has a set of "state transition" rules and a set of "inputs", which together with the set of states define what inputs cause the machine to change from one state to another. Finally, since a machine serves no purpose without producing some output, an FSA has one or more "terminal" states. Output is produced only when the FSA enters such states. A physical example of an FSA could be a light switch. This device has two states, ON and OFF. Its inputs consist of someone pushing the switch UP or DOWN. ON is a terminal state, where the associated lamp produces visible radiation. The state transition rules can be stated in a simple truth table: +------------------+ | | OFF | ON | |----|------|------| | UP | ON | -- | |DOWN| -- | OFF | +------------------+ Alternatively, this could be shown as a "state transition diagram": DOWN +-------------+ | | V | +-----+ UP ****** +-->| OFF |------>* ON *<---+ | +-----+ ****** | | | | | | DOWN | | UP | +------+ +------+ where no state transition on a particular input (as shown in the truth table) is equivalent to a transition from a state to itself (as shown in the diagram). Getting ahead of ourselves for a moment, look at Figure 1a, which shows part of an FSA (the other parts are shown in Figures 1b and 1c, and will be explained shortly). By having sequences of states, it is possible for the FSA to recognize patterns in an input stream. For example, presenting the string "she" to the FSA shown in Figure 1a would cause it to change from State 0 to States 3, 4 and 5 as each input symbol of the string is processed. State 5 is a terminal state that indicates the pattern "she" was recognized. There are two types of FSA that can be built, one "nondeterministic" (Algorithm 1) and the other "deterministic" (Algorithm 2). The nondeterministic one (NFSA for short) uses functions called "go_to", "failure" and "output", while the deterministic FSA (DFSA) uses "move" and "output". The difference between the two is that while the NFSA can make one or more state transitions per input symbol, the DFSA makes only one. With fewer transitions to make, a DFSA can process its input in less time than an equivalent NFSA. The "go_to" function accepts as input parameters the current state of th